Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from ray-project:master #56

Merged
merged 144 commits into from
Nov 9, 2024

Conversation

pull[bot]
Copy link

@pull pull bot commented Oct 29, 2024

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

simonsays1980 and others added 30 commits October 27, 2024 21:18
…ng information about individual connector pieces to the `ctor_args_and_kwargs` file). (#48213)
…mum_gcs_destroyed_actor_cached_count (#48229)

Signed-off-by: dayshah <[email protected]>
…e it is iterating with each iterable value. (#48270)

Signed-off-by: win5923 <[email protected]>
/salute

Signed-off-by: Edward Oakes <[email protected]>
…t level" (#48292)

Reverts #48223

Broke #48289 so revert to
unblock the release.
## Why are these changes needed?

Refactor handle options.
- Change `_options` and `copy_and_update` to use `**kwargs` instead of
listing out the entire set of handle options, so that in the future
adding new options only requires code change in two locations: (1) the
public facing `options()` API, and `_HandleOptions` itself.
- Separate handle options into "dynamic" and "init" options. Dynamic
options can be modified in a lightweight manner with `.options()`, while
init options can only be modified once, before a handle is initialized,
with `.init()`.
- Initializing a handle is equivalent to instantiating the
lazily-created router.
  - `.options()` and `.remote()` implicitly initialize a handle.


---------

Signed-off-by: Cindy Zhang <[email protected]>
….4, pettingzoo 1.24.3 supersuit 3.9.3)." (#48297)

Reverts #45328
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Modifies the `route` attribute used for logs and metrics to limit its
cardinality (see linked issue).

After this change, the behavior will be:

- If the user is not using FastAPI, the `route` will always be the
`route_prefix`.
- If the user is using FastAPI, the `route` will be the `route_prefix` +
the matched route string.

Take the following example:
```
@serve.deployment
@serve.ingress(fastapi_app)
class A:
    @fastapi_app.get("/{user_id}")
    def dynamic_subpath(self, user_id: str) -> str:
        return ...

serve.run(A.bind(), route_prefix="/prefix")
```

Previously, this could have infinite cardinality in the metrics tags (a
unique tag value for each user). With this change, the tag would instead
be: `"/prefix/{user_id}"`.

The "not found" response has also been modified to *not* include a
`route` tag.

TODO:

- [x] Add unit tests for `get_asgi_route_name`
- [x] Add integration tests to `test_metrics.py`
- [x] Add tests for the replica request context

## Related issue number

Closes #47999

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Edward Oakes <[email protected]>
…ING_OR_CONCURRENCY TaskStatus (#48242)

Signed-off-by: Jiajun Yao <[email protected]>
…eployment level" (#48292)" (#48294)

This reverts commit 703194a.

The original commit caused `test_deploy_2.py` to become flaky. It
appears this was a timing issue due to the 10s timeout set on the test.
I've removed this, as it looks like a very old test case that is no
longer relevant.

---------

Signed-off-by: Edward Oakes <[email protected]>
… work on head node (#48103)

This PR updates the compute configs for benchmark release tests to not
schedule workers onto the head node. This reflects the best practice not
to schedule heavy work on the head node for cluster stability.

---------

Signed-off-by: Justin Yu <[email protected]>
## Why are these changes needed?

depends on #48232

Pull out some router init parameters and router init logic into
`create_router`.

---------

Signed-off-by: Cindy Zhang <[email protected]>
## Why are these changes needed?

Move `test_deployment_scheduler.py` from small to medium.


Signed-off-by: Cindy Zhang <[email protected]>
If args is not a list, an exception will be thrown in the above if statement.
## Why are these changes needed?

Implement full recursive cancellation at Serve layer.

Currently, at the time that a parent request is cancelled, if:
- The child request has not been assigned to a replica, _meaning_ the
replica scheduler in the router is still processing that request
- AND the parent request is not directly blocking and waiting on the
result of the child request

Then the child request will not be correctly cancelled. Note that if the
second bullet point were actually flipped, i.e. the parent request _was_
waiting on the result at the time of cancellation, then an
`asyncio.CancelledError` will be sent directly to the replica scheduler
and the child request will get successfully cancelled. However obviously
this cannot be guaranteed to be true.

This PR adds a cache for asyncio tasks that are launched in the router
to assign a request to a replica. When a parent request is cancelled, if
there are any "pending assignment requests" in the cache, they are
cancelled.



Signed-off-by: Cindy Zhang <[email protected]>
Stacked on: #48223

Reimplements the `.bind()` codepath to avoid using the Ray DAG codepath
which adds a lot of complexity for little benefit.

The DAG traversal code is much simpler, around 100 lines of
self-contained code in `build_app.py`. I've also added unit tests for
it.

There should be no behavior change here.

---------

Signed-off-by: Edward Oakes <[email protected]>
…joco 3.2.4, pettingzoo 1.24.3 supersuit 3.9.3)."" (#48308)
Previously the error message was injected directly as the HTTP status
message

Signed-off-by: Richo Healey <[email protected]>
## Why are these changes needed?

fix router.py

Signed-off-by: Cindy Zhang <[email protected]>
…g compilation (#47757)

Leaf nodes are nodes that are not output nodes and have no downstream nodes. If a leaf node raises an exception, it will not be propagated to the driver. Therefore, this PR raises an exception if a leaf node is found during compilation.

Another solution: implicitly add leaf node to MultiOutputNode
Currently, the function execute can return multiple CompiledDAGRefs. The UX we want to provide is to implicitly add leaf nodes to the MultiOutputNode but not return the references of the leaf nodes. For example, a MultiOutputNode is containing 3 DAG nodes (2 normal DAG nodes + 1 leaf node).

x, y = compiled_dag.execute(input_vals) # We don't return the ref for the leaf node.
However, the ref for leaf node will be GC(ed) in execute, and CompiledDAGRef’s del will call get if it was never called which makes execute to be a sync instead of an async operation which is not acceptable.
…#48358)

Some tests are failing when we duplicate the original release tests. Marking them as unstable to be non-blocking.
Will mark as stable once they are truly stable.

Signed-off-by: Rui Qiao <[email protected]>
GcsServer manages several IO Contexts and their threads, and hard codes
which class uses which IO Context. This is not flexible in that we can't
easily declare another IO Context and adapt some classes without a
pervasive code change.

Introduces an IoContextProvider that manages IO Contexts based on a
static, compile-time Policy, whose author can determine class-IO Context
mapping. This way we can change the mappings in a central place without
big chunk of changes everywhere.

Signed-off-by: Ruiyang Wang <[email protected]>
ruisearch42 and others added 29 commits November 6, 2024 19:53
To make the visualization more concise and easier to read:
Show nodes in concise form and show actor information in the subgraph instead.
…U loader thread, enable local learner w/ GPU. (#48314)
and remove linux wheel testing logic. linux is not using this script
anymore.

Signed-off-by: Lonnie Liu <[email protected]>
The current graph shows memory usage of each node along side the MAX memory across the cluster.
For oom debugging, we care more about memory utilization (percentage) on each node.
…ing tasks (#48528)

Rename "Active tasks by name" to "Requested Live tasks". Make it clear that they do not only refer to running tasks but any tasks that has been requested to be scheduled.

Add a graph for "Running tasks by name" that only includes "Running_*" status

Rename "Active actors by name"
…type of container as a DAGNode arg (#48302)

Based on #48045, we have decided to treat a case where a DAGNode is inside any type of container as a DAGNode argument as an error and throw an exception.
In a recent investigation, we saw `Broken pipe` issue when a node is
being preempted and going through the shutdown process.

This is issue is due to a race condition:
* The node is being preempted and is being gracefully shut down.
* At the same time, remote hosts receives the notifications that the
node state changes and send `FreeObjects` requests to the node
* When handling the `FreeObjects` request, the Raylet will call `Delete`
on the local Object store.
* However, when the `Delete` happens, the object store has already been
stopped and thus caused the Broken Pipe issue.

The fix in the code is to move the logic to shutdown GRPC server before
stopping the object store. In this sense, we can make sure when stopping
the Object Store, there will not be on-going or future GRPC requests.

Signed-off-by: Mengjin Yan <[email protected]>
Follow up to comment in #48609
that came in after the PR merged. cc: @pcmoritz

Signed-off-by: angelinalg <[email protected]>
Signed-off-by: angelinalg <[email protected]>
Two issues:
1. When the execution plan caches the dataset schema, it might call
`unify_schema` to produce a single schema from all of the bundles'
schemas. The issue is that this function expects all of the input
schemas to be of Arrow type, but we only check if at least one schema is
of Arrow type before calling the function.
2. `to_pandas` iterates over blocks and adds them to
`DelegatingBlockBuilder`. The issue is that `DelegatingBlockBuilder`
expects all input blocks to be of the same type.

---------

Signed-off-by: Balaji Veeramani <[email protected]>
This PR adds docs to the `dev-workflow.md` complementing the changes in
this PR: #48477

---------

Signed-off-by: Promptless Bot <[email protected]>
Signed-off-by: Frances Liu <[email protected]>
Signed-off-by: frances720 <[email protected]>
Co-authored-by: promptless[bot] <179508745+promptless[bot]@users.noreply.github.com>
Co-authored-by: angelinalg <[email protected]>
## Why are these changes needed?

Add Project operator to select_columns.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Richard Liaw <[email protected]>
This PR integrates `uv` into existing runtime env system, which uses
`uv` to setup environment for better performance.

TODO list:
- [ ] Add `uv` to public documentation

---------

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
…with custom DNS for Kubernetes API. (#48541)

Signed-off-by: ltbringer <[email protected]>
and fix windows build

Signed-off-by: Lonnie Liu <[email protected]>
In #47958 , visualization unit tests were introduced but would not run if CI environment does not have graphvis installed (which was the case).

This PR adds the dependency in test-requirements.txt and always run the visualization tests. It also moves visualization tests to a separate file since test_accelerated_dag.py is too large.
## Why are these changes needed?

Refactor resolve request arg func.

Separate into:
1. actual type checking + argument resolution (injected into the Router
as a function)
2. the logic for executing that resolution and replacing args/kwargs
(kept as part of Router)

---------

Signed-off-by: Cindy Zhang <[email protected]>
Now, when a Ray job is submitted, JobHead starts a Ray actor namely
`JobSupervisor` which fork-and-exec a subprocess with the entrypoint. It
bears all environment variables from the JobSupervisor, with some
additional ones via `_get_driver_env_vars`. However we also want to
*remove* some env vars, specifically the Ray internal ones including
`RAY_JOB_ID` and `RAY_RAYLET_ID` because they are meant for the
JobSupervisor, not for the job we submit. For example, in the entrypoint
`$RAY_JOB_ID` can be set to `01000000` because that's the job id of the
JobSupervisor, while the entrypoint really is for job 08000000.

This did not show up as a bug because these environment variables are
not read by the driver core workers. However we find a user pattern that
they collect all env vars in the driver process and pass it to
tasks/actors via `env_vars`. If they do so, the started workers get
confused by the stale job ids and crash.

Removes those env vars on JobSupervisor for-and-exec to hide them from
the user entrypoint.

Next steps: we should check for internal flags in `env_vars` and raise
Exceptions for internal flags; and/or we can have an option to
auto-remove them for the users. something like:

```py
@ray.remote(runtime_env={
    "env_vars":os.environ, 
    # can be "raise", "allow_debugonly", "remove", default to "raise"
    "env_vars_config":{"internal_flags": "remove"},
})
```

---------

Signed-off-by: Ruiyang Wang <[email protected]>
#47752 introduced a minor, silent
issue in test_map.
I stumbled across it because a linter I'm using complained.
## Why are these changes needed?

Match for the more specific error that is raised, in this case
`TypeError`, instead of matching for generic `RayTaskError`.

---------

Signed-off-by: Cindy Zhang <[email protected]>
@Superskyyy Superskyyy merged commit 1970e6a into Superskyyy:master Nov 9, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.